-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use load
+store
instead of memcpy
for small integer arrays
#111999
Conversation
r? @b-naber (rustbot has picked a reviewer for you, use r? to override) |
This comment has been minimized.
This comment has been minimized.
1310aed
to
44c45ca
Compare
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 44c45ca0492db8f40cdcb19ff7dbfca05e95be75 with merge daa8f4a9835d98cf1139c088d10b29b4e8e084a5... |
Wrong link? |
☀️ Try build successful - checks-actions |
1 similar comment
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (daa8f4a9835d98cf1139c088d10b29b4e8e084a5): comparison URL. Overall result: ❌ regressions - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 643.928s -> 645.194s (0.20%) |
r? compiler |
if flags == MemFlags::empty() | ||
&& let Some(bty) = bx.cx().scalar_copy_backend_type(layout) | ||
{ | ||
// I look forward to only supporting opaque pointers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a FIXME so someone may remove this when llvm only has opaque ptrs? (Well, I guess removing this logic would also preclude other backends with typed ptrs, too. In that case, maybe no comment at all.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a bigger question because GCC would need to move, so I'll leave it tracked by conversations like https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/llvm.20bitcasts.20in.20codegen/near/356591425 instead of something specific in this place.
/// | ||
/// This *should* return `None` for particularly-large types, where leaving | ||
/// the `memcpy` may well be important to avoid code size explosion. | ||
fn scalar_copy_backend_type(&self, layout: TyAndLayout<'tcx>) -> Option<Self::Type> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a default impl here may make it not obvious that other backends should emit typed load/store. Is it bad style to just add this to cranelift and codegen_gcc here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, cg_clif doesn't use cg_ssa, so it's not impacted.
I have no practical way to test cg_gcc, being on windows, and I also don't actually have any information that doing something other than memcpy
for it would actually be an improvement for it, so I figured I'd just leave the default here since it's a semantically sufficient implementation.
Notably, GCC apparently doesn't have the poison
semantics that are what nikic mentioned as being the problem for better optimizing this:
rust/compiler/rustc_codegen_gcc/src/common.rs
Lines 76 to 79 in f8447b9
fn const_poison(&self, typ: Type<'gcc>) -> RValue<'gcc> { | |
// No distinction between undef and poison. | |
self.const_undef(typ) | |
} |
so indeed it might just never need to do this.
let src = bx.pointercast(src, pty); | ||
let dst = bx.pointercast(dst, pty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do these not match pty? Why not just return a more "accurate" type in scalar_copy_llvm_type
? Like the actual LLVM type that corresponds to src/dst? (Or am I misunderstanding something?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Well, the shallow answer is that they don't match because it's -> Option<Type>
, not should_load_store_instead_of_memcpy() -> bool
.)
Because emitting these as the backend type doesn't necessarily do what we want. Notably, this PR is emitting the load/store as <4 x i8>
, the vector type, rather than the llvm_backend_type
of [4 x i8]
for arrays.
I tried using the LLVM type first, but with arrays that results in exploding out the IR: https://llvm.godbolt.org/z/vjjsdea9e
Optimizing the version that loads/stores arrays
define void @replace_short_array_using_arrays(ptr noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
start:
%1 = load [3 x i32], ptr %r, align 4
store [3 x i32] %1, ptr %0, align 4
%2 = load [3 x i32], ptr %v, align 4
store [3 x i32] %2, ptr %r, align 4
ret void
}
gives
define void @replace_short_array_using_arrays(ptr noalias nocapture noundef writeonly sret([3 x i32]) dereferenceable(12) %0, ptr noalias nocapture noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
%.unpack = load i32, ptr %r, align 4
%.elt1 = getelementptr inbounds [3 x i32], ptr %r, i64 0, i64 1
%.unpack2 = load i32, ptr %.elt1, align 4
%.elt3 = getelementptr inbounds [3 x i32], ptr %r, i64 0, i64 2
%.unpack4 = load i32, ptr %.elt3, align 4
store i32 %.unpack, ptr %0, align 4
%.repack5 = getelementptr inbounds [3 x i32], ptr %0, i64 0, i64 1
store i32 %.unpack2, ptr %.repack5, align 4
%.repack7 = getelementptr inbounds [3 x i32], ptr %0, i64 0, i64 2
store i32 %.unpack4, ptr %.repack7, align 4
%.unpack9 = load i32, ptr %v, align 4
%.elt10 = getelementptr inbounds [3 x i32], ptr %v, i64 0, i64 1
%.unpack11 = load i32, ptr %.elt10, align 4
%.elt12 = getelementptr inbounds [3 x i32], ptr %v, i64 0, i64 2
%.unpack13 = load i32, ptr %.elt12, align 4
store i32 %.unpack9, ptr %r, align 4
store i32 %.unpack11, ptr %.elt1, align 4
store i32 %.unpack13, ptr %.elt3, align 4
ret void
}
whereas optimizing the version with vectors leaves the operations together
define void @replace_short_array_using_vectors(ptr noalias nocapture noundef writeonly sret([3 x i32]) dereferenceable(12) %0, ptr noalias nocapture noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
%1 = load <3 x i32>, ptr %r, align 4
store <3 x i32> %1, ptr %0, align 4
%2 = load <3 x i32>, ptr %v, align 4
store <3 x i32> %2, ptr %r, align 4
ret void
}
for the backend to legalize instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and I originally expected to do this with LLVM's arbitrary-length integer support https://llvm.godbolt.org/z/hzG9aeqhM, like I did for comparisons back in #85828
define void @replace_short_array_using_long_integer(ptr noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
start:
%1 = load i96, ptr %r, align 4
store i96 %1, ptr %0, align 4
%2 = load i96, ptr %v, align 4
store i96 %2, ptr %r, align 4
ret void
}
But that broke some of the autovectorization codegen tests for operations on arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah ok, so we explicitly want to be lowering these moves to vector types specifically because they can optimize better than arrays.
To confirm we're not just helping `mem::swap`
44c45ca
to
e1b020d
Compare
Finished benchmarking commit (fd9bf59): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDNext Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 647.471s -> 646.658s (-0.13%) |
@rustbot label: +perf-regression-triaged |
In noticed a quite large performance (15-20%) regression in an analytical database I'm building with [u8; 24] and [u8; 28] arrays. I bisected the regression between nightly commits e6d4725 and b2b34bd. This PR was the obvious suspicion since this changed how arrays that small are copied around. There's a lot of small arrays being moved around in the database I'm building. I didn't notice any performance regressions for other data types (u128 or u64 etc.) I've only tested on aarch64-apple-darwin so far. I haven't published my work yet, but I can come up with a benchmark program if needed. |
(In case you don’t already know about it: cargo-bisect-rustc should bisect within the rollup PR of that last commit, to hopefully give you the offending PR or in/validate it’s this one) |
I'll verify the offending PR using cargo-bisect-rustc. In the meantime here's a simple benchmark which reproduces the issue on my Macbook M1 Pro:
|
Please open an issue so we can track this. Closed PRs aren't a good place for it. |
Will do. |
Remove my `scalar_copy_backend_type` optimization attempt I added this back in rust-lang#111999 , but I no longer think it's a good idea - It had to get scaled back to only power-of-two things to not break a bunch of targets - LLVM seems to be getting better at memcpy removal anyway - Introducing vector instructions has seemed to sometimes (rust-lang#115515 (comment)) make autovectorization worse So this removes it from the codegen crates entirely, and instead just tries to use <https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/builder/trait.BuilderMethods.html#method.typed_place_copy> instead of direct `memcpy` so things will still use load/store when a type isn't `OperandValue::Ref`.
…-errors Remove my `scalar_copy_backend_type` optimization attempt I added this back in rust-lang#111999 , but I no longer think it's a good idea - It had to get scaled back to only power-of-two things to not break a bunch of targets - LLVM seems to be getting better at memcpy removal anyway - Introducing vector instructions has seemed to sometimes (rust-lang#115515 (comment)) make autovectorization worse So this removes it from the codegen crates entirely, and instead just tries to use <https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/builder/trait.BuilderMethods.html#method.typed_place_copy> instead of direct `memcpy` so things will still use load/store when a type isn't `OperandValue::Ref`.
…-errors Remove my `scalar_copy_backend_type` optimization attempt I added this back in rust-lang#111999 , but I no longer think it's a good idea - It had to get scaled back to only power-of-two things to not break a bunch of targets - LLVM seems to be getting better at memcpy removal anyway - Introducing vector instructions has seemed to sometimes (rust-lang#115515 (comment)) make autovectorization worse So this removes it from the codegen crates entirely, and instead just tries to use <https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/builder/trait.BuilderMethods.html#method.typed_place_copy> instead of direct `memcpy` so things will still use load/store when a type isn't `OperandValue::Ref`.
…-errors Remove my `scalar_copy_backend_type` optimization attempt I added this back in rust-lang#111999 , but I no longer think it's a good idea - It had to get scaled back to only power-of-two things to not break a bunch of targets - LLVM seems to be getting better at memcpy removal anyway - Introducing vector instructions has seemed to sometimes (rust-lang#115515 (comment)) make autovectorization worse So this removes it from the codegen crates entirely, and instead just tries to use <https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/builder/trait.BuilderMethods.html#method.typed_place_copy> instead of direct `memcpy` so things will still use load/store when a type isn't `OperandValue::Ref`.
I was inspired by #98892 to see whether, rather than making
mem::swap
do something smart in the library, we could update MIR assignments like*_1 = *_2
to do something smarter thanmemcpy
for sufficiently-small types that doing it inline is going to be better than amemcpy
call in assembly anyway. After all, special code may helpmem::swap
, but if the "obvious" MIR can just result in the correct thing that helps everything -- other code likemem::replace
, people doing it manually, and just passing around by value in general -- as well as makes MIR inlining happier since it doesn't need to deal with all the complicated library code if it just sees a couple assignments.LLVM will turn the short, known-length
memcpy
s into direct instructions in the backend, but that's too late for it to be able to removealloca
s. In general, replacingmemcpy
s with typed instructions is hard in the middle-end -- even formemcpy.inline
where it knows it won't be a function call -- is hard due to poison propagation issues. So because we know more about the type invariants -- these are typed copies -- rustc can emit something more specific, allowing LLVM tomem2reg
away thealloca
s in some situations.#52051 previously did something like this in the library for
mem::swap
, but it ended up regressing during enabling mir inlining (cbbf06b), so this has been suboptimal on stable for ≈5 releases now.The code in this PR is narrowly targeted at just integer arrays in LLVM, but works via a new method on the
LayoutTypeMethods
trait, so specific backends based on cg_ssa can enable this for more situations over time, as we find them. I don't want to try to bite off too much in this PR, though. (Transparent newtypes and simple things like the 3×usizeString
would be obvious candidates for a follow-up.)Codegen demonstrations: https://llvm.godbolt.org/z/fK8hT9aqv
Before:
Note it going to stack:
Now:
still lowers to
dword
+word
operations, but has no stack traffic:And as a demonstration that this isn't just
mem::swap
, amem::replace
on a small array (since replace doesn't use swap since #83022), which used to bememcpy
s in LLVM changes in IRbut that lowers to reasonable
dword
+qword
instructions still